{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#标签特征二元化"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个主题中，我们将用另一种方式来演示分类变量。有些时候只有一两个分类特征是重要的，这时就要避免多余的维度，如果有多个分类变量就有可能会出现这些多余的维度。\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Getting ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "处理分类变量还有另一种方法，不需要通过`OneHotEncoder`，我们可以用`LabelBinarizer`。这是一个阈值与分类变量组合的方法。演示其用法之前，让我们加载`iris`数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets as d\n",
    "iris = d.load_iris()\n",
    "target = iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How to do it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入`LabelBinarizer()`创建一个对象："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelBinarizer\n",
    "label_binarizer = LabelBinarizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，将因变量的值转换成一个新的特征向量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "new_target = label_binarizer.fit_transform(target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们看看`new_target`和`label_binarizer`对象的结果："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 3)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 0, 0],\n",
       "       [1, 0, 0],\n",
       "       [1, 0, 0],\n",
       "       [1, 0, 0],\n",
       "       [1, 0, 0]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_target[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 1],\n",
       "       [0, 0, 1],\n",
       "       [0, 0, 1],\n",
       "       [0, 0, 1],\n",
       "       [0, 0, 1]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_target[-5:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_binarizer.classes_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How it works..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`iris`的因变量基数为`3`，就是说有三种值。当`LabelBinarizer`将$N \\times 1$向量转换成$N \\times C$矩阵时，$C$就是$N \\times 1$向量的基数。需要注意的是，当`label_binarizer`处理因变量之后，再转换基数以外的值都是`[0,0,0]`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_binarizer.transform([4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##There's more..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "0和1并不一定都是表示因变量中的阳性和阴性实例。例如，如果我们需要用`1000`表示阳性值，用`-1000`表示阴性值，我们可以用`label_binarizer`处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1000, -1000, -1000],\n",
       "       [ 1000, -1000, -1000],\n",
       "       [ 1000, -1000, -1000],\n",
       "       [ 1000, -1000, -1000],\n",
       "       [ 1000, -1000, -1000]])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_binarizer = LabelBinarizer(neg_label=-1000, pos_label=1000)\n",
    "label_binarizer.fit_transform(target)[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">阳性和阴性值的唯一限制是，它们必须为整数。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
